48 research outputs found
Long Story Short: a Summarize-then-Search Method for Long Video Question Answering
Large language models such as GPT-3 have demonstrated an impressive
capability to adapt to new tasks without requiring task-specific training data.
This capability has been particularly effective in settings such as narrative
question answering, where the diversity of tasks is immense, but the available
supervision data is small. In this work, we investigate if such language models
can extend their zero-shot reasoning abilities to long multimodal narratives in
multimedia content such as drama, movies, and animation, where the story plays
an essential role. We propose Long Story Short, a framework for narrative video
QA that first summarizes the narrative of the video to a short plot and then
searches parts of the video relevant to the question. We also propose to
enhance visual matching with CLIPCheck. Our model outperforms state-of-the-art
supervised models by a large margin, highlighting the potential of zero-shot QA
for long videos.Comment: Published in BMVC 202
Learning Joint Representation of Human Motion and Language
In this work, we present MoLang (a Motion-Language connecting model) for
learning joint representation of human motion and language, leveraging both
unpaired and paired datasets of motion and language modalities. To this end, we
propose a motion-language model with contrastive learning, empowering our model
to learn better generalizable representations of the human motion domain.
Empirical results show that our model learns strong representations of human
motion data through navigating language modality. Our proposed method is able
to perform both action recognition and motion retrieval tasks with a single
model where it outperforms state-of-the-art approaches on a number of action
recognition benchmarks
CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos
Visual information is central to conversation: body gestures and physical
behaviour, for example, contribute to meaning that transcends words alone. To
date, however, most neural conversational models are limited to just text. We
introduce CHAMPAGNE, a generative model of conversations that can account for
visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a
large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from
web videos: crucial to our data collection pipeline is a pretrained language
model that converts error-prone automatic transcripts to a cleaner dialogue
format while maintaining meaning. Human evaluation reveals that YTD-18M is more
sensible and specific than prior resources (MMDialog, 1M dialogues), while
maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE
learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it
achieves state-of-the-art results on four vision-language tasks focused on
real-world conversations. We release data, models, and code.Comment: ICCV 2023, Project page: https://seungjuhan.me/champagn
Active Visual Search in the Wild
In this paper, we focus on the problem of efficiently locating a target
object described with free-form language using a mobile robot equipped with
vision sensors (e.g., an RGBD camera). Conventional active visual search
predefines a set of objects to search for, rendering these techniques
restrictive in practice. To provide added flexibility in active visual
searching, we propose a system where a user can enter target commands using
free-form language; we call this system Active Visual Search in the Wild
(AVSW). AVSW detects and plans to search for a target object inputted by a user
through a semantic grid map represented by static landmarks (e.g., desk or
bed). For efficient planning of object search patterns, AVSW considers
commonsense knowledge-based co-occurrence and predictive uncertainty while
deciding which landmarks to visit first. We validate the proposed method with
respect to SR (success rate) and SPL (success weighted by path length) in both
simulated and real-world environments. The proposed method outperforms previous
methods in terms of SPL in simulated scenarios with an average gap of 0.283. We
further demonstrate AVSW with a Pioneer-3AT robot in real-world studies